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Abstract 

Pedestrian detection is a problem of considerable prac- 
tical interest. Adding to the list of successful applications 
of deep learning methods to vision, we report state-of-the- 
art and competitive results on all major pedestrian datasets 
with a convolutional network model. The model uses a few 
new twists, such as multi-stage features, connections that 
skip layers to integrate global shape information with local 
distinctive motif information, and an unsupervised method 
based on convolutional sparse coding to pre-train the filters 
at each stage. 



1. Introduction 

Pedestrian detection is a key problem for surveillance, 
automotive safety and robotics applications. The wide vari- 
ety of appearances of pedestrians due to body pose, occlu- 
sions, clothing, lighting and backgrounds makes this task 
challenging. 

All existing state-of-the-art methods use a combination 
of hand-crafted features such as Integral Channel Fea- 
tures [ ], HoG [ ] and their variations [12, 31] and com- 
binations [36], followed by a trainable classifier such as 
SVM [12, 26], boosted classifiers [ ] or random forests [7]. 
While low-level features can be designed by hand with good 
success, mid-level features that combine low-level features 
are difficult to engineer without the help of some sort of 
learning procedure. Multi-stage recognizers that learn hier- 
archies of features tuned to the task at hand can be trained 
end-to-end with little prior knowledge. Convolutional Net- 
works (ConvNets) [ ] are examples of such hierarchical 
systems with end-to-end feature learning that are trained 
in a supervised fashion. Recent works have demonstrated 
the usefulness of unsupervised pre-training for end-to-end 
training of deep multi-stage architectures using a variety 
of techniques such as stacked restricted Boltzmann ma- 
chines [15], stacked auto-encoders [ ] and stacked sparse 
auto-encoders [ 10], and using new types of non-linear trans- 
forms at each layer [16, 19]. 




Figure 1: 128 9 x 9 filters trained on grayscale INRIA images 
using Algorithm 1 . It can be seen that in addition to edge detectors 
at multiple orientations, our systems also learns more complicated 
features such as corner and junction detectors. 



Supervised ConvNets have been used by a number of au- 
thors for such applications as face, hand detection [35, 27, 
14, 29, 13, 34]. More recently, a large ConvNet by [20] 
achieved a breakthrough on a 1000-class ImageNet detec- 
tion task. The main contribution of this paper is to show 
that the ConvNet model, with a few important twists, con- 
sistently yields state of the art and competitive results on 
all major pedestrian detection benchmarks. The system 
uses unsupervised convolutional sparse auto-encoders to 
pre-train features at all levels from the relatively small IN- 
RIA dataset [ ], and end-to-end supervised training to train 
the classifier and fine-tune the features in an integrated fash- 
ion. Additionally, multi-stage features with layer-skipping 
connections enable output stages to combine global shape 
detectors with local motif detectors. 

While processing speed is not the focus of this paper, 
orthogonal techniques introduced by [3] can be applied to 
our system for fast pedestrian detection. These techniques 
greatly reduce the number of processed scales compared 
to most systems, also reducing chances of false positives. 
Hence we may experience further accuracy boosts with such 
scaling schemes. 
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2. Learning Feature Hierarchies 

Much of the work on pedestrian detection have focused 
on designing representative and powerful features [5, 9, 8, 
36]. In this work, we show that generic feature learning al- 
gorithms can produce successful feature extractors that can 
achieve state-of-the-art results. 

Supervised learning of end-to-end systems on images 
have been shown to work well when there is abundant la- 
beled samples [21], including for detection tasks [35, 27, 
14, 29, 13, 34]. However, for many input domains, it is 
hard to find adequate number of labeled data. In this case, 
one can resort to designing useful features by using domain 
knowledge, or an alternative way is to use unsupervised 
learning algorithms. Recently unsupervised learning algo- 
rithms have been demonstrated to produce good features for 
generic object recognition problems [22, 23, 17, 19]. 

In [15], it was shown that unsupervised learning can be 
used to train deep hierarchical models and the final repre- 
sentation achieved is actually useful for a variety of differ- 
ent tasks [30, 22, 4]. In this work, we also follow a similar 
approach and train a generic unsupervised model at each 
layer using the output representation from the layer before. 
This process is then followed by supervised updates to the 
whole hierarchical system using label information. 
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Figure 2: A subset of 7 x 7 second layer filters trained on 
grayscale INRIA images using Algorithm 2. Each row in the fig- 
ure shows filters that connect to a common output feature map. 
It can be seen that they extract features at similar locations and 
shapes. 



2.1. Hierarchical Model 

A hierarchical feature extraction system consists of mul- 
tiple levels of feature extractors that perform the same fil- 
tering and non-linear transformation functions in successive 
layers. Using a particular generic parametrized function one 
can then map the inputs into gradually more higher level (or 
abstract) representations [21, 15, 4, 30, 22]. In this work 
we use sparse convolutional feature hierarchies as proposed 
in [19], Each layer of the unsupervised model contains a 
convolutional sparse coding algorithm and a predictor func- 
tion that can be used for fast inference. After the last layer 



a classifier is used to map the feature representations into 
class labels. Both the sparse coding dictionary and the pre- 
dictor function do not contain any hard-coded parameter 
and are trained from the input data. 

The training procedure for this model is similar to [15]. 
Each layer is trained in an unsupervised manner using the 
representation from previous layer (or the input image for 
the initial layer) separately. After the whole multi-stage sys- 
tem is trained in a layer-wise fashion, the complete architec- 
ture followed by a classifier is fine-tuned using labeled data. 

2.2. Unsupervised Learning 

Recently sparse coding has seen much interest in many 
fields due to its ability to extract useful feature representa- 
tions from data, The general formulation of sparse coding is 
a linear reconstruction model using an overcomplete dictio- 
nary T> 6 K mx ™ where m > n and a regularization penalty 
on the mixing coefficients z 6 W 1 . 

z* = argmin \\x — T>z\\\ + Xs(z) (1) 

z 

The aim is to minimize equation 1 with respect to z to obtain 
the optimal sparse representation z* that correspond to input 
x E W n . The exact form of s(z) depends on the particular 
sparse coding algorithm that is used, here, we use the ||.||i 
norm penalty, which is the sum of the absolute values of 
all elements of z. It is immediately clear that the solution of 
this system requires an optimization process. Many efficient 
algorithms for solving the above convex system has been 
proposed in recent years [1, 6, 2, 24]. However, our aim is 
to also learn generic feature extractors. For that reason we 
minimize equation 1 wrt V too. 

z*,V* = argmin ||x - Vz\\l + X\\z\\i (2) 

z,T> 

This resulting equation is non-convex in V and z at the same 
time, however keeping one fixed, the problem is still convex 
wrt to the other variable. All sparse modeling algorithms 
that adopt the dictionary matrix T> exploit this property 
and perform a coordinate descent like minimization pro- 
cess where each variable is updated in succession. Follow- 
ing [28] many authors have used sparse dictionary learning 
to represent images [25, 1, 18]. However, most of the sparse 
coding models use small image patches as input x to learn 
the dictionary T> and then apply the resulting model to every 
overlapping patch location on the full image. This approach 
assumes that the sparse representation for two neighboring 
patches with a single pixel shift is completely independent, 
thus produces very redundant representations. [19, 37] have 
introduced convolutional sparse modeling formulations for 
feature learning and object recognition and we use the Con- 
volutional Predictive Sparse Decomposition (CPSD) model 
proposed in [19] since it is the only convolutional sparse 
coding model providing a fast predictor function that is suit- 
able for building multi-stage feature representations. The 



particular predictor function we use is similar to a single 
layer ConvNet of the following form: 



f(x; g,k,b) = % = g ■ taring ® k + b) 



(3) 



where • operator represents component wise multiplication 
and (gi operator represents convolution of x with a set of 
filters k. In this formulation x is a p x p grayscale input 
image, k £ ^nxmxm j s a set Q f 2D filters, each of size m x 
m, g and b are vectors with n elements, the predictor output 

z e K™xp-« i + lx p- m + 1 - s a set ^ f eature ma p S ^ eac h f 

size p — m+lxp— rn+1. Considering this general predictor 
function, the final form of the convolutional unsupervised 
energy is as follows: 



EcPSD = ^ConvSC + ^-Pred 
KconvSC = \\X-V®Z\\1 + \\\Z\ 



(4) 
(5) 
(6) 



where T> is a dictionary of filters the same size as k. The 
unsupervised learning procedure is a two step coordinate 
descent process. At each iteration, (1) Inference: The pa- 
rameters W — {V, g, k, b} are kept fixed and equation 5 is 
minimized to obtain the optimal sparse representation z*, 
(2) Update: Keeping z* fixed, the parameters W updated 
using a stochastic gradient step: W <— W — r] dE ^^ D 
where 77 is the learning rate parameter. The inference pro- 
cedure requires us to carry out the sparse coding problem 
solution. For this step we use the FISTA method proposed 
in [ ] . This method is an extension of the original iterative 
shrinkage and thresholding algorithm [ ] using an improved 
step size calculation with a momentum-like term. We apply 
the FISTA algorithm in the image domain adopting the con- 
volutional formulation. 

For color images or other multi-modal feature represen- 
tations, the input x is a set of feature maps indexed by i and 
the representation z is also a set of feature maps indexed 
by j. We define a map of connections P from input x to 
features z. Each output feature map is connected to a set of 
input feature maps. Thus, the predictor function is defined 
in Algorithm 1 . 



% = 9j ' tanh \. { x i ® kij ) + bj 



(7) 



and the reconstruction is computed using the inverse map 
P: 

EconvSC = W Xi ~ P « ® 1 1 i + A IMIl ( 8 > 

For a fully connected layer, all the input features are con- 
nected to all the output features, however it is also common 
to use sparse connection maps to reduce the number of pa- 
rameters. The online training algorithm for unsupervised 
training of a single layer is: 



Algorithm 1 Single layer unsupervised training. 

function Unsup(a;, V, P, {a, /?}, {g, k, b}) 
Set: f(x; g, k, b) from eqn 7, W = {g, k, b}. 
Initialize: z = 0, P and W randomly, 
repeat 

Perform inference, minimize equation 8 wrt z using 
FISTA [2] 

Do a stochastic update on V and W. V <— V — 



,)E, 



and W <r- W 



until convergence 
Return: {V,g,k,b} 
end function 



<>'L r ,. 



dW 



2.3. Non-Linear Transformations 

Once the unsupervised learning for a single stage is com- 
pleted, the next stage is trained on the feature representation 
from the previous one. In order to obtain the feature repre- 
sentation for the next stage, we use the predictor function 
f(x) followed by non-linear transformations and pooling. 
Following the multi-stage framework used in [19], we ap- 
ply absolute value rectification, local contrast normalization 
and average down-sampling operations. 
Absolute Value Rectification is applied component- wise to 
the whole feature output from f(x) in order to avoid cancel- 
lation problems in contrast normalization and pooling steps. 
Local Contrast Normalization is a non-linear process that 
enhances the most active feature and suppresses the other 
ones. The exact form of the operation is as follows: 



Vi — Xi — Xi ® w 



max(c, a) 



w O vf (9) 
(10) 



where i is the feature map index and w is a 9 x 9 Gaus- 
sian weighting function with normalized weights so that 



E 



i P q w pg 



= 1. For each sample, the constant c is set to 



mean(a) in the experiments. 

Average Down-Sampling operation is performed using a 
fixed size boxcar kernel with a certain step size. The size 
of the kernel and the stride are given for each experiment in 
the following sections. 

Once a single layer of the network is trained, the features 
for training a successive layer is extracted using the predic- 
tor function followed by non-linear transformations. De- 
tailed procedure of training an layer hierarchical model 
is explained in Algorithm 2. 

The first layer features can be easily displayed in the pa- 
rameter space since the parameter space and the input space 
is same, however visualizing the second and higher level 
features in the input space can only be possible when only 
invertible operations are used in between layers. However, 
since we use absolute value rectification and local contrast 
normalization operations mapping the second layer features 



Algorithm 2 Multi-layer unsupervised training. 

function HierarUnsup (x, rii, rrn, Pi, {at, ft}, {wi, Si}, 
i = {l..N}) 

Set: i = 1, Xi = x, lcn(x) using equations 9-10, 
ds(X, w, s) as the down-sampling operator using box- 
car kernel of size w x w and stride of size s in both 
directions, 
repeat 

Set: Vi,ki € l" iXmiXmi , g i: h 6 M n '. 
, fcj , g'j, hi , 6j} = 

Unsup(X l , 2^ , , {a, , ft } , {g, , fe, , 6 l }) 
z = f(Xf,gi, h,bi) using equation 7. 

2=1*1 _ 
z = lcn(z) 
Xi+i = ds(z,Wi,Si) 
i = i + l 
until i = TV 
end function 




input ' 1st stage ' 2nd stage ' classifier 

Figure 3: A multi-scale convolutional network. The top row 
of maps constitute a regular ConvNet [16]. The bottom row 
in which the 1st stage output is branched, subsampled again 
and merged into the classifier input provides a multi-stage 
component to the classifier stage. The multi-stage features 
coming out of the 2nd stage extracts a global structure as 
well as local details. 



onto input space is not possible. In Figure 2 we show a sub- 
set of 1664 second layer features in the parameter space. 

2.4. Supervised Training 

After the unsupervised learning of the hierarchical fea- 
ture extraction system is completed using Algorithm 2, we 
append a classifier function, usually in the form of a linear 
logistic regression, and perform stochastic online training 
using labeled data. 

2.5. Multi-Stage Features 

ConvNets are usually organized in a strictly feed- 
forward manner where one layer only takes the output of 
the previous layer as input. Features extracted this way tend 
to be high level features after a few stages of convolutions 
and subsampling. By branching lower levels' outputs into 
the top classifier (Fig. 3), one produces features that extract 
both global shapes and structures and local details, such as 



a global silhouette and face components in the case of hu- 
man detection. Contrary to [1 1], the output of the first stage 
is branched after the non-linear transformations and pool- 
ing/subsampling operations rather than before. 

We also use color information on the training data. For 
this purpose we convert all images into YUV image space 
and subsample the UV features by 3 since color information 
is in much lower resolution. Then at the first stage, we keep 
feature extraction systems for Y and UV channels separate. 
On the Y channel, we use 32 7 x 7 features followed by ab- 
solute value rectification, contrast normalization and 3x3 
subsampling. On the subsampled UV channels, we extract 
6 5x5 features followed by absolute value rectification and 
contrast normalization, skipping the usual subsampling step 
since it was performed beforehand. These features are then 
concatanated to produce 38 feature maps that are input to 
the first layer. The second layer feature extraction takes 38 
feature maps and produces 68 output features using 2040 
9x9 features. A randomly selected 20% of the connec- 
tions in mapping from input features to output features is 
removed to limit the computational requirements and break 
the symmetry [ ]. The output of the second layer features 
are then transformed using absolute value rectification and 
contrast normalization followed by 2 x 2 subsampling. This 
results in 17824 dimensional feature vector for each sample 
which is then fed into a linear classifier. 

In Table 1 , we show that multi-stage features improve ac- 
curacy for different tasks, with different magnitudes. Great- 
est improvements are obtained for pedestrian detection and 
traffic-sign classification while only minimal gains are ob- 
tained for house numbers classification, a less complex task. 

2.6. Bootstrapping 

Bootstrapping is typically used in detection settings by 
extracting the most offending negative answers and adding 
these samples multiple times to the existing dataset during 
training. For this purpose, we extract 3000 negative sam- 
ples per bootstrapping pass and limit the number of most 
offending answers to 5 for each image. We perform 3 boot- 
strapping passes in addition to the original training phase 
(i.e. totally 4 training passes). 

2.7. Non-Maximum Suppression 

Non-maximum suppression (NMS) is used to resolve 
conflicts when several bounding boxes overlap. For both 
INRIA and Caltech experiments we use the widely accepted 
PASCAL overlap criteria to determine a matching score be- 
tween two bounding boxes ( mtersectlon \ anc j if two boxes 

° v union J 

overlap by more than 60%, only the one with the highest 
score is kept. In [10]'s addendum, the matching criteria is 
modified by replacing the union of the two boxes with the 
minimum of the two. Therefore, if a box is fully contained 
in another one the small box is selected. The goal for this 
modification is to avoid false positives that are due to pedes- 
trian body parts. However, a drawback to this approach 
is that it always disregards one of the overlapping pedes- 



Task 


Single-Stage features 


Multi-Stage features 


Improvement % 


Pedestrians detection (INRIA) 


14.26% 


9.85% 


31% 


Traffic Signs classification (GTSRB) [33] 


1.80% 


0.83% 


54% 


House Numbers classification (SVHN) [32] 


5.54% 


5.36% 


3.2% 



Table 1 : Error rates improvements of multi-stage features over single-stage features for different types of objects detection and classi- 
fication. Improvements are significant for multi-scale and textured objects such as traffic signs and pedestrians but minimal for house 
numbers. 



trians from detection. Instead of changing the criteria, we 
actively modify our training set before each bootstrapping 
phase. We include body part images that cause false posi- 
tive detection into our bootstrapping image set. Our model 
can then learn to suppress such responses within a positive 
window and still detect pedestrians within bigger windows 
more reliably. 

3. Experiments 

We evaluate our system on two most common pedes- 
trian detection benchmark datasets: INRIA and Caltech. 
However, for training we only use INRIA dataset. We also 
show experiments that demonstrate the improvements com- 
ing from unsupervised training and using color and multi- 
resolution features. In the following we name our model 
as CN-MRC "ConvNet - Multi Resolution and Color" with 
two variants. CN-MRC-Unsup represents the model that is 
initialized using hierarchical unsupervised training and also 
fine-tuned using labeled data. CN-MRC-Supervised repre- 
sents the model that is initialized with random weights and 
only trained using labeled information. 

3.1. Data Preparation 

The ConvNet is trained on the INRIA pedestrian 
dataset [5]. Pedestrians are extracted into windows of 126 
pixels in height and 78 pixels in width. The context ratio 
is 1.4, i.e. pedestrians are 90 pixels high and the remaining 
36 pixels correspond to the background. Each pedestrian 
image is mirrored along the horizontal axis to expand the 
dataset. Similarly, we add 5 variations of each original sam- 
ple using 5 random deformations such as translations and 
scale. Translations range from -2 to 2 pixels and scale ratios 
from 0.95 to 1.05. These deformations enforce invariance 
to small deformations in the input. The range of each de- 
formation determines the trade-off between recognition and 
localization accuracy during detection. An equal amount 
of background samples are extracted at random from the 
negative images and taking approximately 10% of the ex- 
tracted samples for validation yields a validation set with 
2000 samples and training set with 21845 samples. 

3.2. Evaluation Protocol 

During testing and bootstrapping phases using the IN- 
RIA dataset, the images are both up-sampled and sub- 
sampled. The up-sampling ratio is 1.3 while the sub- 



sampling ratio is limited by 0.75 times the network's mini- 
mum input (126 x 78). We use a scale stride of 1.10 between 
each scale, while other methods typically use either 1.05 or 
1 .20 [10]. A higher scale stride is desirable as it implies less 
computations. 

For evaluation we use the bounding boxes files published 
on the Caltech Pedestrian website 1 and the evaluation soft- 
ware provided by Piotr Dollar (version 3.0.1). In an ef- 
fort to provide a more accurate evaluation, we improved on 
both the evaluation formula and the INRIA annotations as 
follows. The evaluation software was slightly modified to 
compute the full area under curve (AUC) in the entire [0, 1] 
range rather than from 9 discrete points only (0.01, 0.0178, 
0.0316, 0.0562, 0.1, 0.1778, 0.3162, 0.5623 and 1.0 in ver- 
sion 3.0.1). Instead, we compute the entire area under the 
curve by summing the areas under the piece-wise linear in- 
terpolation of the curve, between each pair of points. In 
addition, we also report a 'fixed' version of the annotations 
for INRIA dataset, which has missing positive labels. The 
added labels are only used to avoid counting false errors and 
wrongly penalizing algorithms. The modified code and ex- 
tra INRIA labels are available at 2 . Table 2 reports results 
for both original and fixed INRIA datasets. Notice that the 
full AUC and fixed INRIA annotations both yield a reorder- 
ing of the results. 

To ensure a fair comparison, we separated systems 
trained on INRIA (the majority) from systems trained on 
TUD-MotionPairs and the only system trained on Caltech 
in table 2. For clarity, only systems trained on INRIA were 
represented in Figure 5, however all results for all systems 
are still reported in table 2. 

3.3. Results 

In Figure 4, we plot DET curves, i.e. miss rate ver- 
sus false positives per image (FPPI), on the fixed INRIA 
dataset and rank algorithms along two measures: the error 
rate at 1 FPPI and the area under curve (AUC) rate in the 
[0, 1] FPPI range. This graph shows the individual con- 
tributions of unsupervised learning (ConvNet-Unsup) and 
multi-stage features learning (ConvNet-MRC-Supervised) 
and their combination (ConvNet-MRC-Unsup) compared 
to the fully-supervised system without multi-stage features 
(ConvNet-Supervised). Considering the AUC measure, un- 

'http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians 
2 http://cs.nyu.edu/~sermanet/data.html#inria 




Area Under Curve [0 : 1] FPPI 
Shapelet-orig (94.71%) 
PoselnvSvm (79.04%) 
Poselnv (72.02%) 
Shapelet (65.09%) 
VJ-OpenCv (62.35%) 
VJ (57.94%) 
FtrMine (44.36%) 
HOG (33.52%) 
Pis (30.49%) 
HikSvm (30.13%) 
LatSvm-Vl (28.20%) 
ConvNet-Supervised (26.05%) 
MultiFtr (22.76%) 

ConvNet-MRC-Supervised (20.43%) 
ConvNet-Unsup (17.81%) 
MultiFtr+CSS (16.18%) 
LatSvm-V2 (14.39%) 
FPDW (13.17%) 
ChnFtrs (12.92%) 
ConvNet-MRC-Unsup (11.05%) 



1.0 rppi 
Shapelet-orig (91.13%) 
PoselnvSvm (68.76%) 
Poselnv (55.01%) 
VJ-OpenCv (52.97%) 
Shapelet (50.25%) 
VJ (47.37%) 
FtrMine (33.96%) 
Pis (23.26%) 
HOG (22.58%) 
HikSvm (20.54%) 
LatSvm-Vl (16.81%) 
MultiFtr (15.11%) 
ConvNet-Supervised (14.26%) 
MultiFtr+CSS (10.70%) 
ConvNet-Unsup (10.19%) 
ConvNet-MRC-Supervised (9.85%) 
FPDW (9.34%) 
LatSvm-V2 (8.66%) 
ChnFtrs (8.66%) 
ConvNet-MRC-Unsup (6.62%) 



Figure 4: DET curves on the INRIA (with complete annotations) report false positives per image (FPPI) against miss rate. Algorithms 
are sorted from top to bottom using 2 metrics: on the left is the area under curve (AUC) between and 1 FPPI, on the right is the miss 
rate at 1 FPPI. This graph shows the individual contributions of unsupervised learning (ConvNet-Unsup) and multi-stage features learning 
(ConvNet-MRC-Supervised) and their combination (ConvNet-MRC-Unsup) compared to the fully-supervised system without multi-stage 
features (ConvNet-Supervised). This graph was produced using the open-source VisionGrader tool, we also report more results below 
using Piotr Dollar's evaluation tool. 



supervised learning exhibits the most improvements with 
17.81% error compared to the baseline ConvNet (26.05%). 
Multi-stage features without unsupervised learning reach 
20.43% error while their combination yields the state of the 
art error rate of 1 1 .05%. 

Extensive results comparison of all major pedestrian 
datasets and published systems is provided in Table 2. Mul- 
tiple types of measures proposed by [10] are reported. For 
clarity, we also plot in Figure 5 two of these measures, 
'reasonable' and Targe', for INRIA-trained systems. The 
Targe' plot shows that the ConvNet results in state-of-the- 
art performance with some margin on the ETH, Caltech and 
TudBrussels datasets and is closely behind LatSvm-V2 and 
VeryFast for INRIA and Daimler datasets. In the 'reason- 
able' plot, the ConvNet yields competitive results for IN- 
RIA, Daimler and ETH datasets but performs poorly on the 
Caltech dataset. We suspect the ConvNet with multi-stage 
features trained at high-resolution is more sensitive to reso- 
lution loss than other methods. In future work, a ConvNet 
trained at multiple resolution will likely learn to use appro- 
priate cues for each resolution regime. 

4. Discussion 

We have introduced a new feature learning model with 
an application to pedestrian detection. Contrary to popu- 
lar models where the low-level features are hand-designed, 



our model learns all the features at all levels in a hierar- 
chy. We used the method of [19] as a baseline, and ex- 
tended it by combining high and low resolution features 
in the model, and by learning features on the color chan- 
nels of the input. Using the INRIA dataset, we have shown 
that these improvements provide clear performance bene- 
fits. The resulting model provides state of the art or com- 
petitive results on most measures of all publicly available 
datasets. Small-scale pedestrian measures can be improved 
in future work by training multiple scale models relying less 
on high-resolution details. Future work should also inves- 
tigate speed-up techniques by [ ] which may also improve 
performance simply by reducing the number of scales to 
process, hence reducing chances of false positives. 
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Figure 5: AUC percentage of all INRIA-trained systems on all major datasets. The AUC is computed from DET curves (smaller AUC 
means more accuracy and less false positives. For clarity, each ConvNet performance is connected by dotted lines. Only the 'reasonable' 
and 'large' measures are plotted here, however all measures are reported in table 2. The ConvNet system yields state-of-the-art or compet- 
itive results on most datasets and measures, except for the low resolutions measures on the Caltech dataset because of higher reliance on 
high-resolution cues than other methods. 
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Table 2: This table reports the performance of all systems on all datasets using the full AUC percentage over the considered range [0,1] 
from DET curves. DET curves plot false positives per image (FPPI) against miss rate. Hence a smaller AUC% means a more accurate 
system with greater reduction of false positives. Top performing results (INRIA-trained only) are highlighted in bold. We report the 
multiple measures introduced by [ ] for all major pedestrian datasets. The far, occlusion and aspect-ratio measures are only available for 
the Caltech dataset. 
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